POCORENT BBSOBE 



SD 103 450 



95 



TB 004 291 



&nTHOR 
TITLE 

IKSTITOTIOM 

SPOKS A6EKCT 

REPORT NO 
POB DATE 
NOTE 



BOSS PPICE 
DESCPIPTORS 



IDEKTIFIERS 



Kosecofff Jacqueline B.; Klein, Stephen P. 
Instrttctional Sensitivity Statistics Appropriate for 
Objectives-Based Test Items. CSB Report No. 91. 
California Oniv.# Los Angeles. Center for the Study 
of Evaluation. 

National Inst, of Education (DBER) , Washington, 
D.C. 

CSE-R-91 
Apr 74 

30p.; Paper presented at the Annual fleeting of the 
National Council on Heasuresent in Edacation 
(Chicago, Illinois, April 1974) 

KP-$0.76 HC-$1.95 PLUS POSTAGE 

♦Academic Achievement; Correlation; *Criterion 

Referenced Tests; '•'Evaluation Methods; Guessing 

(Tests) ; Instruction; *Ite» Analysis; Post Testing; 

Pretesting; Response Style (Tests) ; ♦statistical 

Analysis; Test construction; Testing: Test 

Validity 

External Sensitivity Index; Internal Sensitivity 
Index; *lteB Sensitivity 



ABSTRACT 

Two types of sensitivity indices were developed in 
this paper, one internal to the total test and the second external. 
To evaluate the success of these statistics the three criteria 
suggested for a satisfactory index of itea quality were considered. 
The Internal Sensitivity Index appears to »eet these demands. 
Certainly it is easily computed. In addition its moderately positive 
correlations with other traditional statistics confirms that the ISI 
provides unique information and yet is not inconsistent with these 
indices. However, when there are a large number of masters at the 
pretest an alternate form of the ISI is sometimes necessary to 
demonstrate item sensitivity. Finally, the theoretical construction 
of the ISI is both intuitively understandable and similar in form to 
other statistics. The External Sensitivity Index, on the other hand, 
does not fair as well as its internal counterpart. Although 
computationally simple it fails to demonstrate any consistent 
correlations with the traditional Indices, suggesting a rather random 
statistic. Perhaps a single item is not sufficient to provide a 
stable, reliable measure of the effects of instruction. The ISI 
appears to provide a suitable measure of an item's ability to 
distinguish between those who have and have not benefited from 
instruction. Further, the most appropriate approach for evaluating 
item quality is an examination of the item in context with total test 
performance. (Aathor/BJ6) 



BEST COPY AVAILABU 



INSTRUCTIONAL SENSITIVITY STATISTICS APPROPRIATE 
FOR OBJECTIVES-BASED TEST ITEMS* 



Jacqueline B. Kosecoff 
and 

Stephen P. Klein 



MATIONAL iNtriTMrc OP 
eOUCATtON 

This OOCUWEMT maS rEPRO 
DUCED EXACTtr AS DECEIVED TftOM 
THE PERSON OP OACANiZATrOta ORiGiM 
ATiMOiT f>OiMTSOF viEWOf^OpiMrOMS 
Stfttf O DO NOT NECESSAftitY REPAE 
StMT OPF iCiAt NATiOMAt INSTITUTE OP 
EDUCATION PO&tTiON OR POLICY 



CSE Report No. 91 
April 1974 



O 
0 



Program for Research on Objectives-Based Evaluation 
Center for the Study of Evaluation 
UCLA Graduate School of Education 
Los Angeles, California 



*Paper presented at the annual meeting of the National Council on 
Measurement in Education, Chicago, April, 1974. 



ERIC 



S£Si GOBc mimt 



Center for the Js/tdi 
Study of .s^fllS 




MARVIN C. ALKIN 
DIRECTOR 



UCLA Graduate School of Education 



The CENTER FOR THE STUDY OF EVALUATION is one of eight educa- 
tional research and development centers sponsored by th© U.S. Office of 
Education, Departmint of Health, Education and Welfare. Established at 
UCLA in 1966, undei pro\d8ions of the Cooperative Research Act, CSE is 
devoted exclusively to the area of evaluation. 

The mission of the Center is to conduct research and development activi- 
ties for the production of new materials, practices and knowledge leading to 
the development of systems for evaluating education which can be adopted 
and implemented by educational agencies. The scope of activity includes 
the development of procedures and methodologies needed in the prectica! 
conduct of evaluation studies of various types, and the development of gener- 
alizable theories and concepts of evaluation relevai.t to different levels of 
education. 

This publication is one of many produced by the Center toward its goals. 
Infonnatiou on CSE and its pubUcatlons may be obtained by writing to: 



*Th6 Center Is presently funded by the National Institute of Education. 



Dissemination Office 
Center for the Study of Evaluation 
UCLA Graduate School of Education 
Los Angeles, CaHfomia 90024 



3 



ERIC 



BEST wn mum 

(Me of the typical steps in evaluating the quality of test items involves 
cxandniuR the degree to student performance on the item is related to 

ritudent performance on the total test. The basic assunpticm underlying this 
internal consistency approach to assessing item quality is that the total test 
score is the best available criterion of the degree to whidi students have 
Kktftered the contczit for whidi the test was ctesigied. Thus, an item is con- 
sidercd "good" if it discriminates between hi^ and low achievers in essen- 
tially the same way as does the total test, 

Inten^al consistency indices of item discrimination, sudi as the point 
biserial correlation coefficient obtained between item and total test scores, 
have been used extensively in the construction of tests designed to make com- 
parisons among students. Such indices are not maximally ^propriate, however, 
for assessing item quality on measures designed to evaluate the effects of 
educat ional programs since discrimination indices are not uniquely sensitive 
to the effect of instruction. In other words, typical item discrimination 
in Lees are so often influenced by a number of factors affecting test scores, 
such as general intellecf-oal ability, that they m whether the item 
truly discriminates between those who have versus those who have not profited 
from the effects of instructiw. This situation has given rise to a nunber 
of item sensitivity indices; that is, indices that reflect an item's sensi- 
tivity to instruction. 

This paper describes several ^current attenpts to provide some useful in- 
dices by whicli a test developer could judgp the adequacy of his/her test items 
in terms of the extent to which the items reflect instruction. In addition, 
two new sensitivity indices are proposed, and the characteristics of these 
inaces arc compared to one another and to the traditional discrimination 
statistics. 

4 



BEST COPY AVAILABLE 

It -jl! '.V noUNl ti>.'t er'tort^ towards estimating item sensitivity have 
ivcn ..-.ociatcM .ilmost exclusively with critcrion-rerercnced testing situations 

conparci to nonv reforoncod) . Ihis does not mean that item sensitivity 
TKlj/e. arv limited to situations where a test is to be interpreted using a 
.::u.K.'T,-rolv.renced metric, hut rather that criterion- referenced tests are 
thought to bo nnr> appropriate for the evaluation of instnictionni programs. 
^:..;^:t,vity indiccs should be associated with tlie question, "Can this item 
di^cruninate botweeii learners and non- learners?" and not with whether the 
test IS intc.ided for criterion- or nom- referenced interpretation. 

CIRKl-N Si-XSniVlTY INDin-S 

Lox and Vargas (lOoo) proposed a pretest-posttest difference sensitivity 
index that was oi^tained by computing "the percentage of students who pass the 
I ten' on the posttest minus the percentage who pass the item on the pretest." 
Similar to the notion of raw gain, this index measures the percentage of 
students who had not rnaste-ed the item before instruction (at the pretest) 
hut who had nia<<tered the item after instniction (at the posttest). This in- 
dex does not attend to how the item l>ehaves with respect to the total test, 
to whetlier t!ie item can discriminate between a group of students who actually 
learned and those who vlid not, or to corrections for guessing. Cox and Var- 
gas correlated their imlex with the traditional discrimination index (top 
r- - -bottom l"". ) a.id found rather low correlation coefficients suggesting 
fund.nental differences between these indices. 

Ponham f I'.ro) experimented with measuring changes which occur in items 
ovfr an instnict ional period. He identified four possibilities: for any 
given learner, an item could be answered incorrectly on both the pre- and 
posttests (IT), correctly on the pretest but incorrectly on the posttest fPF), 



incorrcctl-.- oti tlio prctost but correctly on the posttcst (FP), nnd correctly 
.^n !H-tli th<? rvo- :ina posttests fPP). A situation characterized by a high 
percent of i^P's wns considered one reflecting learning, uliereas a high per- 
centage of PP's indicated negative learning. 

l-.vo statistics were exi^lored. i-irst for each item the percentage of 
students resp<^nding in each of the four ways was tabulated and items \vere 
r.uiKed twice, according to highest percentage in the PP and PV- categories 
respectively. Wien the two sets of rankings were compared, a negative cor- 
relation coefficient was obtained, suggesting a trend towards learning. 
Using this approach, an item is viewed as external to or independent of the 
total test. A second statistic, however, considered an item's homogeneity 
Kith the total test {i.e., an internal index). \ 4xk Chi-square test (where 
? refers to the PP, PF, FP and PP categories and k to the number of itens 
measuring the same objective) was conducted to measure the degree to which 
these items performed similarly with respect to the four possible response 
natterns. A non-significant test would indicate that all items performed 
similarly and thus reflected the effects of instruction in the same way. 
After field testing these statistics, Popham concluded that neither stat- 
istic represented an appropriate "red flag" for identiiying items that fail 
to discriminate among learners. 

lioth Cox :md Vargus' difference index and Popham's Chi-square stat- 
istic were employed by Ozenne (1971) to initially select items for a cri- 
terion-referenced measure. Ozenne's major focus, however, was not with the 
instructional sensitivity of a single item but with the total test. Using 
analysi-^ of variance techniques Ozenne proposed a model that accounted for 
the variability of subject responses under a variety of criterion- referenced 
test situations md, in turn, lead to an estimate of the total test's sensi- 



BEST COPV AVAIUBLE 

tivity to msttiiction. Using a true experiment (instruction versus no instruc- 
tion) Ozemc'A work provide? the most sensitive test of the effects of instruc- 
tion. In this paoer the major concern irf not to validate the total test nor 
to measure the impact of an instructional sequence, but rather to approximate 
'in itoni's sensi.ivity to instruction. 

Koudabush (10^3) h;'S sugpested still another sensitivity index that con- 
siders the item response patterns used by Popham but also provides a correc- 
tion for guessing. This model borrows from a procedure described by Marks 
and Noll (19C*7) developed for use in a slightly different context. This model 
is used again tc develop new indices in another section of this paper and is 
presented there in some detail. 

In terms of Popham response categories Roudabush defines an item sen- 
sitivity index as 

s ■= FP 
FP + FF 

\vhere the denotes that the percentage of responses falling into each cate- 
gory that have been corrected for guessing. That is, FF represents the "true" 
percentage of learners who did not master the item and FP the "true" percen- 
tage of learners who did not know the item at the pretest but mastered it by 
the posttest. This index measures the proportion of students that missed the 
item on the pretest ami then correctly responded to it on the posttest after 
a correction for guessing is applied; it does not, however, measure the "gain' 

n learners from the pretest to the posttest. Once again, s is computed in- 
dependently of the total test score and thus serves as an external Cto the 
test) or test -independent sensitivity index. This procedure was applied to a 
criterion- referenced reading test. Roudabush (1073) concluded, "Using sensi- 
tivity to instruction as the major criterion for item selection leads to 

4 



choo5i!Mi •» liitonMit ^ot oi' itons than wouhl ordinarily he chooscn (p. 11)" 

i'i<i!:i4 the iraaitional indices). 

l'.^ ^unnari-o, current efforts have i*ocused on comparing an item's response 
pattern prior to and post instruction. In most cases an item is considered in- 
Jcpendent of the total test and the resulting statistic CJin he descrihed as an 
LvxtoiT...: sensitivity index. Hield testing these indices have not yet lead to 
I >ini;le index that reflects the effects of instru< \. However, one consis- 
tent result has energed; sensitivity indices tend to select different items 
than their traditional counteiparts. 

NBV SLNSiriVITV IXDICES 

In this section two sensitivity indices will he developed. The first 
statistic, an internal sensitivity index (IS!) measures an item's perfomance 
vithin tlK context the total v^st» comparing how a given item and the en- 
tire test discriminate among learners. The second statistic, an external 
sensitivity index fi-:si) measures an item independently of the total test, 
providing an estimate of an individual item's ahility to assess learning. 
\ correction for guessing (used as well hy Roudabush) is provided for the 
ESr. The development of these statistics were guided by three criteria. An 
item sensitivity statistic must: 

a. optimize ease in computation, 

h. provide unitjue information, and 

c. he relatively consistent with other general indices of item quality. 
Internal Sensitivity Index (I SI) . 

Consider the pattern for pre- posttest performance among students who 
correctly responded to item i as depicted in Table 1. The total saiiple 
(n + n, * n^ + n, = N.) represents the number of students who passed item i 

1 »^ ^ 



ERIC 



5 

s 



BEST COPY hmmii 

;it the |vr.:?o.st. \h' mr.l^or of scores fnlling into cell (I, I) reflects the 
rivMuency of stuJents failing both the pretest ami the posttest among tnose 
•Jio cuirectly responJed to item i. This is an undesirable outccsme since item 
: failoa to identify a non-learning situation: students have remained 
n-v. -Ki-^tors alter instruction and yet they correctly responded to item 1 on 
rue posttest. Scores falliuR into cell (1»2), on the other hand, suggest a 
-^te JesiriMc outcome. In this case students who correctly responded to 
item i on the posttest ivere non-masters before instruction and have reached 
master)- by the posttest. Cells 1) and {2. 2) are situations in which stu- 
dents ivere already masters prior to instruction. In cell (2»1) students who 
h.-kl previously mastered the material based on a pretest, responded correctly 
to iten i on the posttest hut failed the total posttest indicating non-mastery 
or neftative learning. Hopefully such situations are rare, particularly when 
pre- and posttests are close together in time providing little opportunity for 
forgetting. !-inally. scores falling invo cell (2,2) suggest that students who 
correct I V ansvcered item i on the pos^-test were able to demonstrate mastery both 
prior to and following instruction. Although this pattern is not undesirable 
in tcir.s of item i's sensitivity, teaching already acquired skills is certainly 
questionable. 



Insert Table 1 about here 



lo investigate the effects of instruction we need only study those stu- 
dents who fail the pretest (i.e.. tiio students who are non-masters with re- 
spect to the total test prior to instruction).* v;ith respect to posttest 

*\'ote that tliis nodel assumes that a definition of mastery can be estab- 
lished* Some guidelines for mastery testing, are put forth by Harris, 1974. 
In this same paper Harris also sets a precedent for considering selected por- 
tions of the data (as we do later in an alternate LSI). 



ERIC 



9 



I AHI.i: I 



ni^trilmtion of Students Responding (*orrectly to Item i 
in terms of Pre- and Posttest Performance 



! ^-nil Post test 


Pass Post test 


marginals 




' a: 1 i ;\-U-st 1 

) 
1 


tun 


n^ 

(1,:) 




1 

?A<> Pretest 


n, 

CD 


(2,2) 




narginaLs 




n,*nj 





where 

n = observed frequency of students who answered item i correctly on the 
^ posttest hut failed the pre- and posttest 

- observed frequency of students who answered item i correctly on the 
' ix)sttest but failed the pretest and passed the posttest 

! , = observed frequency of students who answered item i correctly on the 
posttest hut passed tlie pretest nnd failed the posttest 

n = obsen'ed frequency of students who answered item i correctly on the 
^ posttest and passed the pre- and posttest s 



7 

10 



BEST COPY AVAilABLE 

^v Mvs, th.- pj >p..tfi >n nt students oorrtvtly responainR to itein i who failed 
tiu' protest hut passea the po-^ttest niniis the proportion of students giving 
t!io collect response to item I who fail both the pre- and posttests provides 
a -UMsnre of an item's *iensiti\ity to instruction. That is, a sensitivity 
; ulo\ <UovAd aiscriminate among stiulents (correctly .mswering item i) who were 
n(Mi-nasters hel\>re instinct ion and masters after instruction. In formula no- 
t ttiiMi tl:is statistic can he e\7»resseil :is: 



n, ♦ n^ + n^ * n, .\ 

[f a passinj; score on the test is equated with mastery of the associated 
instructional objectives, then the ISI provides a measure of an item's ability 
to discriminate between those who have and have not profited from instruction. 

rhe possi!>le scores on t!ie ISI range from -1 to +1. A score of -1 occurs 
wlien all students fail both the pre- and posttests but correctly respond to 
item i on the posttes* . Certainly such an item is not sensitive to instruc- 
tion and does not discriminate between masters and non-masters in a desirable 
fashion. <>n the other hand, a score of ♦! is obtained when all students who 
properly answer item i on the posttest fail the pretest but pass the posttest. 
This is the ideal situation; item i can discriminate between students viho are 
non-nastors prior to instruction and masters after instruction.* Any scores 
in cell.s CD and C^l) fi.e., n. and/or n^ ^ 0) will force the ISI to be 
less ^ban ' i: . This is also a desirable property as students falling into 

*It should be noted that the satisfaction derived from an index value of 
♦ 1 is directlv related to Ni (the number of students who passed item i on the 
nr^-j^test). It is possible that only one student passes item i at the posttest 
(Ni « i) and that he (she) was a non -master at protest and a master at post - 
test. In sucii a case LSI « +1, but in view of the value of +1, there is 
little cause for celebration. 



8 



this cato^;,'f•^ iiv by di't init loii master^ prior to instniction and therefore should 
he directed to other instruct ionai activities (rather than repeating already 
mas teroJ mat e r i a 1 s ) . 
External Sensitivity Index il-.M) 

rho i:si .attends to item quality from a test -independent point of view. 
Once again let us turn to 4 possi{>le categories of resjwnse to item 1 across 
pre- and postttv-^t. The mo<.iel for this approach depicted in tahle 2 closely 
resembles ti\at tor the ISI; liowever, like Roudahush and Popham, we now con- 
sider the resfwnscs to item i on pre- and posttest independent of tocal test 
performance. The total sample (nj+n^-^n^+n4=N. )* now repn-sents all learners 
tested and the scores failing into cell (1,2). for example, reflect the fre- 
quc-ncy of students who mi'-s item i on the pretest hut pass item i on the 
posttest . 

Insert Tahle 2 about here 

The derivation of the liSI is analogous to that of ISI. Once again we are 
only concemcd with students wiio were non-nasters (in terms of item i) at the 
pretest, that is» those students falling into cells (1,1) and (1,2), 

The proportion of students who were non -masters at the pretest but mas* 
ters on the posttest minus the proportion of students who were non-masters 
at the pretest and remained non-masters at the posttest provides a second, 
test- independent measure of an item's sensitivity. In formula notation this 

*Note that in this model N« the total number of students tested while 
in the model for the ISI, the denominator Ni = the number of students 
passing item i on the posttest. 



9 



BEST con AVJULABU 

TABLE 2 

Response for Students Responding 
to Item i Across Pre and Posttests 





Pail i on Post test 


Pass 1 on rosttesc 


MoT^CTi no 1 c 


Fail i on Pretest 


(1,1) 


(X,2) 




Pass i on Pretest 


n, 

(2,1) 


(2,2) 




Marginals 




^2*^4 





n, = observed frequency of students \vho missed item i on the pretest and the 
post test 

^ observed frequency of students who missed item i on the pretest but 
responded correctly on the posttest 

n, « observed frequency of students who responded correctly to item i on the 
^ pretest but missed it on the posttest 

n. * observed frequency of students who answered item i correctly on the pre 
and posttest 



10 

13 



statistic can be expressed as: 

(2) EST n^^n^ -JV^ 

Comparing the formulas for the ISI and ESI, it is clear that these in- 
dices do not differ in computational form; however, each utilizes different 
t>T>es of frequencies (i.e., different definitions for n^,, n2,n3 + n^ and 
consequently provides different kinds of information. The ISI measures item 
ciuality from the perspective of the total test's discriminating power while 
the ESI offers an individual estimate of how an item reflects learning. 

Like the ISI, the values of the ESI can range from -1 to +1. A score of 
-1 would occur when no one learned; that is, each student failed item i on 
both the pretest and the posttest. Such a result suggests that either instruc- 
tion failed to benefit any of the students or more realistically that the itm 
fails to discriminate among learners. A score of +1 on the other hand is 
obtained ivhen all students fail item i on the pretest but pass item i on the 
posttest. This is the ideal situation; item i shows maximum change in the 
direction of learning. Finally, any scores in cells (2,1) and (2,2) (i.e., 
n^ and/or n. / 0) will lower the absolute value of the ESI. 

Correction for Guessing for External Sen sitivity Index. The ESI can be 
further redefined to correct for guessing. Traditionally a predetermined 
correction for guessing based on an item's format (e.g., the number of dis- 
tractors in a multiple-choice test) is universally applied to all similarly 
formatted items in a given test. In this section an alternate formula for 
estimating the probability of guessing the correct response for a particular 
item is derived based on Marks and Noll."^ This correction, based on the fre- 
quencies displayed in Table 2 rather than on item foxiiiat, test length or 

^Mark's and Noll's correction method was also applied by Roudabush. 



11 



other cons ivlerat ions, can assume different values for each item. Using this 
correction we can solve for the expected frequencies (or true values) of the 
cells in Table 2 and can derive an ESI that reflects any biases due to guessing. 
We begin our derivation by making the folloiwng assumptions: 

a. There is a non-zero probability, p, that a student who does not know 
the answer will guess correctly, wiiere p is derived from observed 
data rather than a predeteimined value based on the item's format. 

b. Scores are independent from pre- to posttest (e.g., there is no sys- 
tematic bias due to recollection of responses on the pretest). 

c. There is no systematic forgetting between pretest and the posttest, 
and therefore E(2,l) « E(n3) s 0. 

\Vhen deriving p, the probability of guessing the correct answer, we will 
refer to Table 2 and its notation. In addition, the following notation will 
be employed: 

V, - the true frequency of cell (1,1)^ ; the number students who leg- 
^ itijnately did not knoi/ the answer to item i at both pre- and post- 
tests (i.e. , students who did not learn) 

V, « the true frequency of cell- (1,2); the number of students who leg- 
2 itimately did not know the answer to item i at the pretest but 
then learned by the posttest (i.e., students who learned) 

V, = the true frequency of cell (2,1); the number of sti^ents who leg- 
^ itimately knew the answer to item i at the pretest but no: at the 
posttest (note that according to assumption 3, we expect to find 
zero students in this cell, that is ^(nJ) "v-* 0) 

V, = the true frequency of cell (2,2); the number of students who knew 
^ the answer to iteb i at both the pre- and posttests (i.e., students 
who always knew) 

We are now ready to compute the exj^ected cell frequencies and the value 
of p. Consider the cell (1,1). The observed frequency n^ can be entirely 



^In probabilitic terms, n^ is the expected value of y 



ERIC 



12 

15 



ERIC 



accounted for by those students ;vho did not learn (v^) and guessed vnrong on 
item i t\/ice (on the pre- and posttests). The probability o£ guessing cor- 
rectly at the rx)sttest is p, and consequently the probability of making a bad 
guess is 1-p. Applying the multiplication rule for probability we have the 
probability of guessing wrong twice is Cl-p)^. Therefore, the observed n^ can 
be expressed mathematically as: 

(3) 0^'V)\ 

Equations for n2,n3 and n^ can be derived using similar reasoning. The 
observed frequency in cell (1.2) can be explained by students who learned but 
guessed wrong on item i on the pretest [(l-p)v2] plus students who did not 
learn and guessed unsuccessfully on item i on the posttest [(p(l-P)vi]. That 
is, 

(4) = (1-p) V2 + pCl-t) ^1 

Students falling in cell (2,1) are those who did not learn but guessed 
successfully on the pretest and unsuccessfully on the posttest [p(l-p)vj^l. 
Recall that we have assumed that students do not forget during instruction 
and consequently that the situation of knowing item i before instniction but 
not after instruction is impossible (vjS 0). Therefore we have 

(5) « p(l-p)vj 

Finally, the observed frequency in cell (2,2) can be accounted for by a 
combination of students who always loiew (v^) , students viho learned and guessed 

correctly on item i on the pretest (V2) , and students who never learned but 

2 • • 

guessed correctly twice (p v^). This yields 

(6) « V4 * pv2 + v\ 
From equations (3) and (5) we can solve for p. 
(3): n^ » a-V)\ 

(5): n^^ « P(l-P)v3^ 

13 
1*> 



and therefore 

(7) P = "3 



Proceeding in a similar fashion we can use equations (1) through (5) to 
find the following expected cell frequencies: 
(8) = (^1*^^3^^ 



"1 

(9) ^2 - Cn2"^3^ ^^^1*^3^ 

_ J- 



(10) 



= 0 



(11) ^4 " ^4 "JV^ 

«1 

(12) ^1*^2*^3*''4" "1*^2*^3% 

A corrected external sensitivity index can then be cominited: 

(X3) ESI* = V2-V3^ « n^+n^ [(n2-n5)-Cn^+n^)] 

Parallels to Traditional Indicators 

The internal and external sensitivity indices have many similarities to 
traditional item statistics. First, both sensitivity indices range from -1 
to 1 as do the item discrimination index (top 27% - bottom 27%) and the cor- 
relation coefficients. 

Second, the categorical distribution underlying the sensitivity indices 
is structurally similar to the reliability coefficient (in that it can be 
viewed as the fraction of true outcomes to total outcomes for a particular 
definition of desirable performance). Students falling in the fail -pass (FP) 
and fail -fail (FF) categories (i.e., -who fail the pretest) have scores that 




BEST con AVAILABLE 

can be inCluencea by instruction; these sources of score distribution can be 
compared \^th true score variation. Students falling in the pass-fail (PF) 
and nass-nass (PP) categories (i.e., who are masters nrior to instruction) 
cannot be influenced by instruction; these sources of score distribution can 
be compared with en«or variation. Finally, the FP, FF, PF, and PP categories 
represent all possibilities for score distribution and can be coinpared with 

total score variation. Therefore, FP^-FF , the proportion of score distribu- 

N 

tion that can be instructionally influenced, parallels the proportion of true- 
to-total score variation, that is, the reliability coefficient. 

Finally, if one were to search for specific parallels to the ISI and ESI 
among traditional indices , the point biserial discrimination index and the phi 
coefficient respectively seem the most appropriate candidates. Both the ISI 
and point biserial measure the extent to which an item performs in concert with 
the total test. In the same fashion both the ESI and phi coefficient (between 
two items) measure the extent to which two items share similar response pat- 
terns. Computationally, the ESI can be thought of as a phi coefficient be- 
tween item i on the pretest and item i on the posttest. 

DATA APPLICATION 

The ISI and ESI and two traditional indices (phi and point biserial) were 
computed using two sets of test data. The first was a 7-item multiple choice 
test measuring knowledge of Campbell and Stanley's research designs. This 
test, designed by the authors, was administered to their graduate level in- 
troductory statistics courses prior to and after instruction. The second 
data source was a 70-item multiple-choice test administered to 115 students 
before and after they received a ninth-grade mathematics program. The test 
used for this purpose was developed by a school district and was designed to 

15 
IS 



ERIC 



assess student performance on those objectives that the district considered 
to be most important at that grade level. For both tests a score above the 
test mean was considered to indicate mastery. (These levels reflect the test 
developers' suggestions, Harris (1974) has presented some guidance for estab- 
lishing mastery levels). 

The re<;ults of these efforts are displayed in tables 3 through 8. Be- 
cause of the manageable number of items in the first 7-item test, a complete 
listing of intennediate results is provided for this measure In tables 3 through 
6. Table 3 presents the itm response patterns for the computation of the ISI. 
Each 2x2 matrix is analogous to Table 1, the numbers inside each cell are the 
n's for a given item. Similarly, Table 4 presents the analogue of Table 2, 
giving both then's and v's required for the computation of the ESI and 3SI*. 
In Table 5, the values of the relevant statistics are displayed for each item. 

A review of the values for the various indices reveals that the values 
of the ESI (both corrected and uncorrected for guessing) are quite low and that 
the ESI corrected for guessing is generally lower than its non-corrected coun- 
terpart. The ISI values are typically higher than the ESI and tend to parallel 
the point biserial and phi coefficients. 

On the whole, the average sensitivity indices are quite low, suggesting 
at first glance that the test items were not particularly sensitive to instruc- 
tion. However, upon a second, more careful inspection of the data, and in 
specific, the response patterns in tables 3 and 4, an alternate explanation 
emerges. We note that many students were masters prior to instruction as ev- 
idenced by the sizable frequencies in ce:.ls (2,2) (i.e., the values of n^ were 
large) . Frequently, as many as half the students demonstrated mastery of the 
materials at the pretest. Consequently, even though the difference between 
cells (1,1) and (1,2) was considerable (i.e., item i discrimina-jed among 

16 

IB 



TABLE 3 



Item-Response Patterns for Computation of 
Internal Sensitivity Index/7 -item test* 



Total 
Posttest Score 







fail 


pass 


Total 
Pretest 


fail 


1 


11 


Score 


pass 


0 


19 






N«31 






item 1 




Total 
Posttest Score 
fail pass 


Total 
Pretest 


fail 


2 


21 


Score 


pass 


1 


26 



item 3 



Total 
Posttest Score 



Total 
Pretest 


fail 


0 


19 


Score 


pass 


1 


25 






N«47 






item 5 






Total 




Posttest 


Score 






fail 


pass 


Total 
Pretest 


fail 


0 


19 


Score 


pass 


1 


25 



Total 
Posttest Score 



Total - 
Pretest ^^^^ 
Score 

pass 



fail 


pass 


0 


15 


0 


23 







item 2 



Total 
Posttest Score 









fail 


j>ass 


Total 
Pretest 


fail 




2 


21 


Score 


pass 




1 


26 


















item 4 






Total 
Posttest Score 
fail pass . 


Total 
Pretest 


fail 




1 


20 


Score 


pass 




0 


26 



item 6 



item 7 

*Cells contain number of students passing each item on the postest 



ERIC 



17 

20 



TABLE 4 



BEST GOn AVAiUBU 

Item Response Patterns for Computation o£ External Sensitivity 
Index/7- item test (numbers in parentheses correspond to v s) 



incorrect 



PRETEST 



correct 



incorrect 



PRETEST 



correct 



incorrect 



PRETEST 



correct 



incorrect 



PRETEST 



correct 



POSTTEST 



incorrect 


correct 


10 


9 


(36.1) 


(0) , 


9 


22 


(0) 1 


(13.9]_ 




N«50 


item 1 


POSTTEST 


incorrect 


correct 


0 


3 


(0) 


(3) 


0 


47 


,,. (0) 


,.(47) 




N-50 


it«n 3 


POSTTEST 


incorrect correct 




6 


al.s) 


(7.5) 


3 


39 


(0) 


(30) 






it^ 5 


POSTTEST 


incorrect correct 


2 


31 


(12.5) 


(45) 


1 


16 


1 (0) 


(.50) 



N«50 



item 7 



POSTTEST 



PRETEST 



incorrect 



correct 



incorrect 



PRETEST 



correct 



PRETEST 



incorrect 



correct 



incorrect 


: correct 


9 


28 


(16.0) 


, (?3.3) 


3 


10 


(0) 


(.67) 




N«50 


Item 2 


POSTTEST 


incorrect correct 


0 


16 


(0) 




0 


34 


(0) 


(34) 


N«50 



item 4 



POSTTEST 



2 

... 


4 

.. fz-s) 


1 

(0) 


16 
.„ ,(-50) 



N«50 



item 6 



18 



21 



^ CO 



to 



^ «^ 



to 



• 



in 



00 



5 



4-» 



9$ 



CM 



Pi tn 
to o 



O 

< u 

•H 



CM 
O 



9$ 



CM 



OD 
O 



00 



CM 
C^ 



to 



CM 

to 



o 

I 



CO 



o 

CM 



CM csj 



tof^r 



4-» 




S CM O 
• li> 



to O 



fH o 
CM O 



l IiS ••tn ♦•lO ^rHP^ 

A^A 1^4 



ERIC 



I 



CM 



^ CM Q 
• • lO 


00 
• •in 


00 to o 

• • lO 


%o o 
• to in 


to 






1 


1 


1 


§ 





00 «o 
00 lO 
• • I 

s 



CM 



I 



19 

22 



TABLE 6 

Alternate Sensitivity Indices 
(Adjusted for Masters Prior to Instruction) 







ESI* 


ESI 


i ucm X 


• oo 


-1.00 


-.OS 


Item 2 


1.00 


.41 


.51 


Item 3 


.83 


1.00 


1.00 


Item 4 


.83 


1.00 


1.00 


Item 5 


1.00 


-.25 


.5 


Item 6 


.90 


.43 


.88 


Item 7 


1.00 


0.67 


.33 




20 



23 



l^retest 
Posttest 



ESI 
ISI 



PHI (item 
with pass/fail 
on posttest) 

P-BIS (iten 
with posttest 
score) 



TABLE 7 

Summary Results for 70 -item Test 



X SD N 

15.61 7.18 115 

31.08 11.90 115 

-.40 .30 70 

-.18 .28 70 

.12 .22 70 

.31 .16 70 

.36 .16 70 



21 

?4 



TABLE 8 



Correlations betu-een Traditional and Sensitivity Indices, 

70 -item test 





ISI 


nsi* 


ESI 


PHI 


PBIS 


ISI 


1.00 


-.07 


-.22* 


.83 


.82** 


liSI* 




1.00 


.88** 


.34** 


.32** 


nsi 






1.00 


.23* 


.21* 


PHI 








1.00 


.97** 


PBIS 










1.00 



22 



leax-ncrs .uui tu>u lea mors) the large frequencies in cells (2,2) tended to re- 
duce this effect. 

In order to detect item sensitivity in this situation an alternate form 
of the indices u-as utilized in which the scores of students demonstrating mas- 
UTV at the pretest were not taken into account in solving for the sensitivity 
indices. In computational terms, the values of n^ and n^ were removed from 
the denojiiinator and the formulae for these alternate indices became* 

LSI = (n^ - nj)/Cn^ + n,) (n^.n^ defined in Table 1) (14) 
(uncorrected) ESI « (n^ - ^i)/Cni*n2) (:^i*^z "^^^^^^ "^^^^^ ^^^^ 
(corrected) ESI* » (v2 - Vj)/(vj+v2) (^1.^2 defined in (8) (9)) (16) 

Tliese values are presented in Table 6. The consistently high values for the 
alternate ISI confirm our suspicion that items \^fere artificially deflated by 
a high proportion of prior masters and were indeed sensitive to instruction. 
On the other liand, the greatly varying values for the ESI tend to reduce our 
confidence in this statistic. 

Inspection of Table 7 reveals a similar pattern in the 70- item exam. 
TIic valaes of the ESI and ESI* are quite low and vary considerably vdiile the 
ISI values are higher, more consistent and tend to parallel the values for the 
phi and phis coefficients. 

In Table 8, correlations between the various indices are presented for 
the 70- item test. (Correlations could not be comiwted for the 7 -item test 
as N=7). The ISI was significantly correlated with both the p-biserial and 
phi coefficient. It would appear then that these 3 indices would tend to 
select many of the same items as "good" or bad. In contrast the correlates 

The use of partial data is not new to psychometrics. Harris (1974), 
for example, also considers selected data in his discussion of technical 
characteristics of mastery tests. 

23 



for the nsr with tlie phi and point biserial although significant, were rather 
small, suggesting that this index would not give the same judgment of an item 
'IS the traditional statistics. Apparently considering an item independently 
of the total test leads to very different results than viewing an item in 
terms of total test performance. Perhaps an item considered as a single, in- 
dependent mea:,ure is not powerful and/or stable enough to discrijninate among 
those who have and have not profited from instruction. 

CONCLUSIONS 

Two types of sensitivity indices viere developed in this paper, one inter- 
nal to the total test and the second external. To evaluate the success of 
these statistics we considered the three criteria suggested for a satisfactory 
index of item quality. The ISI appears to meet these demands. Certainly it 
is easily cmnputed. In addition its moderately positive correlations vrith 
other traditional statistics confiiros that the ISI provides unique informa- 
tion and yet is not inconsistent with these indices. However, v^ien there are 
a large number of masters at the pretest an alternate form of the ISI is some- 
times necessary to demonstrate item sensitivity. Finally, the theoretical 
construction of the ISI is both intuitively understandable and similar in form 
to other statistics. The ESI, on the other hand, does not fair as well as its 
internal counterpart. Although computationally simple it fails to demonstrate 
any consistent correlations with the traditional indices, suggesting a rather 
random statistic. Perhaps a single item (or an item viewed independently of 
the total test) is not sufficient to provide a stable, reliable measure of the 
effects of instruction. 

In summary, the ISI appears to provide a suitable measure of an item's 
ability to distinguish between those who have and have not benefited from 



24 



instruction. Further, the most appropriate approach for evaluating item qual 
ity is an examination of the item in context with total test performance. 




References 

Canpbeli, D. T. , f, Stanley, J. C. l^^er mental and ciuasi- experimental 
designs for research . Chicagol Rand ?'cNall>s ly&e. 

Cox» R. C, Vargas, J. S. A comparison of item selection techniques for 
nomi-referenceu and criterion-referenced tests. Paper presented at 
tlie annual meeting of the National Council on Measurement in Education, 
Chicago, 1966. 

Harris, C. W. Some technical characteristics of mastery tests. In C. W. 
iiarris, ^^ C. Alkin, § W. J. Popham (Eds.), Problans in criterion- 
referenced measurement . CSE Nfonograph Series in Evaluation, No. 3. 
Los Angeles : Center for the Study of Evaluation, University of 
California, 1974. 

Marks, E. , 5 Noll, G. A. Procedures and criteria for evaluating reading 
and listening comprehension tests. Educational and Psychological 
Measurement , 1967, 27_, 335-348. 

Ozenne. D. G. Toward an evaluation methodology for criterion-referenced 
measures: Test Sensitivity. CSE Report No. 72. Los Angeles: Cen- 
ter for the Study of Evaluation, University of California, Oct. , 1971. 

Popham. W. J. Indices of adequacy for criterion- referenced test it^js. 
A sNTTiposium presentation at the joint session of the National Coinicil 
for Measurement in Education and the American «^ucational Research 
Association, New Orleans, 1973. 

Roudabush, G. E. Item selection for criterion-referenced tests. Paper 
presented at the annual meeting of the American Educational Research 
Association, New Orleans, 1973. 



2w 

BEST con AVMUffiLE 



The work upon which this publication Is based was performed pursuant to 
a contract w1 the National Institute of Education. Department of Health, 
Education and ifelfare. Points of view or opinions stated do not ntces- 
sarlly represent official NIE position or policy. 



30 



